class: title-slide, left, bottom # Combining a smooth information criterion with neural networks ---- ## **Andrew McInerney**, ** ** ### University of Limerick #### CMStatistics 2023, 17 December 2023 --- # Background -- <img src="data:image/png;base64,#img/crt-logo.jpg" width="60%" style="display: block; margin: auto;" /> -- * Research: Neural networks from a statistical-modelling perspective -- <img src="data:image/png;base64,#img/packages.png" width="70%" style="display: block; margin: auto;" /> --- # Smooth Information Criterion -- <img src="data:image/png;base64,#img/sic-publication.png" width="100%" style="display: block; margin: auto;" /> --- # Smooth Information Criterion $$ \text{BIC} = -2\ell(\theta) + \log(n) \left[ \sum_{j=1}^p |\beta_j|^0 + 1 \right] $$ -- where `\begin{equation*} \ell(\theta)= -\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-x_i^T\beta)^2 \end{equation*}` --- # Smooth Information Criterion $$ \text{BIC} = -2\ell(\theta) + \log(n) \left[ \sum_{j=1}^p |\beta_j|^0 + 1 \right] $$ -- Introduce "smooth BIC": -- $$ \text{SBIC} = -2\ell(\theta) + \log(n) \left[ \sum_{j=1}^p \frac{{\beta_j^2}}{\beta_j^2 + \epsilon^2} + 1 \right] $$ --- # Smooth Information Criterion <img src="data:image/png;base64,#img/smooth-l0.jpg" width="100%" style="display: block; margin: auto;" /> --- # `\(\epsilon\)`-telescoping $$ \text{SBIC} = -2\ell(\theta) + \log(n) \left[ \sum_{j=1}^p \frac{{\beta_j^2}}{\beta_j^2 + \epsilon^2} + 1 \right] $$ -- * Optimal `\(\epsilon\)` is zero -- * Smaller `\(\epsilon\)` `\(\implies\)` less numerically stable -- * Start with larger `\(\epsilon\)`, and "telescope" through a decreasing sequence of `\(\epsilon\)` values using warm starts --- # R Package <img src="data:image/png;base64,#img/smoothic.png" width="100%" style="display: block; margin: auto;" /> --- # Extending to Neural Networks `$$\mathbb{E}(y) = \text{NN}(X, \theta)$$` -- where `$$\text{NN}(X, \theta) = \phi_o \left[ \gamma_0+\sum_{k=1}^q \gamma_k \phi_h \left( \sum_{j=0}^p \omega_{jk}x_{j}\right) \right]$$` --- # Extending to Neural Networks <p style="font-size: 0.85em"> $$ \text{SBIC} = -2\ell(\theta) + \log(n) \left[ \sum_{jk} \frac{{\omega_{jk}^2}}{\omega_{jk}^2 + \epsilon^2} + \sum_{k} \frac{{\gamma_k^2}}{\gamma_k^2 + \epsilon^2} + q + 1 \right] $$ </p> -- where <p style="font-size: 1em"> $$ \ell(\theta) = -\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\text{NN}(x_i))^2 $$ </p> --- # Simulation Setup <img src="data:image/png;base64,#img/sim-setup.png" width="50%" style="display: block; margin: auto;" /> --- # Results <img src="data:image/png;base64,#img/sim-single-results.png" width="90%" style="display: block; margin: auto;" /> --- # Extending to Group Sparsity -- .pull-left[ <img src="data:image/png;base64,#img/input-group.png" width="100%" style="display: block; margin: auto;" /> ] -- Single penalty: `\begin{equation*} \frac{\omega_{jk}^2}{\omega_{jk}^2 + \epsilon^2} \end{equation*}` -- Group penalty: $$ \text{card}(\omega_j) \times \frac{||\omega_j||_2^2}{||\omega_j||_2^2 + \epsilon^2} $$ --- class: inputgroup-slide # Group Sparsity ## Input-neuron penalization <p style="font-size: 0.78em"> $$ \text{IN-SBIC} = -2\ell(\theta) + \log(n) \left[ q \times \sum_{j}\frac{||\omega_j||_2^2}{||\omega_j||_2^2 + \epsilon^2} + \sum_{k} \frac{{\gamma_k^2}}{\gamma_k^2 + \epsilon^2} + q + 1 \right] $$ </p> where `\(\omega_{j} = (\omega_{j1},\omega_{j2},\dotsc,\omega_{jq})^T\)` --- class: hiddengroup-slide # Group Sparsity ## Hidden-neuron penalization <p style="font-size: 0.78em"> $$ \text{HN-SBIC} = -2\ell(\theta) + \log(n) \left[ (p + 1) \times \sum_{k}\frac{||\theta^{(k)}||_2^2}{||\theta^{(k)}||_2^2 + \epsilon^2} + q + 1 \right] $$ </p> where `\(\theta^{(k)} = (\omega_{1k},\omega_{2k},\dotsc,\omega_{pk}, \gamma_k)^T\)` --- # Simulation Setup <img src="data:image/png;base64,#img/sim-setup.png" width="50%" style="display: block; margin: auto;" /> --- # Results (IN-SBIC) <img src="data:image/png;base64,#img/sim-input-results.png" width="90%" style="display: block; margin: auto;" /> --- # Data Application -- ### Insurance Data (Kaggle) -- 1,338 beneficiaries enrolled in an insurance plan -- Response: `charges` -- 6 Explanatory Variables: `age,` `sex,` `bmi,` `children,` `smoker,` `region` --- # Data Application - Results <img src="data:image/png;base64,#img/insurance.png" width="100%" style="display: block; margin: auto;" /> --- # Conclusion -- - Extend smooth information criterion to neural networks -- - Also extend it to allow for group sparsity -- - Preliminary results look good --- class: bigger # References * <font size="5">McInerney, A., & Burke, K. (2022). A statistically-based approach to feedforward neural network model selection. <i>arXiv preprint arXiv:2207.04248</i>. </font> * <font size="5">McInerney, A., & Burke, K. (2023). Interpreting feedforward neural networks as statistical models. <i>arXiv preprint arXiv:2311.08139</i>. </font> * <font size="5">O’Neill, M. and Burke, K. (2023). Variable selection using a smooth information criterion for distributional regression models. <i>Statistics and Computing, 33(3), p.71</i>. </font> ### R Packages ```r devtools::install_github(c("andrew-mcinerney/selectnn", "andrew-mcinerney/interpretnn")) ```
<font size="5.5">andrew-mcinerney</font>
<font size="5.5">@amcinerney_</font>
<font size="5.5">andrew.mcinerney@ul.ie</font>